Add exception handler on HTTP/2 parent channel to suppress WARN logs#48890
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a Netty channel handler to suppress noisy “exceptionCaught reached tail of pipeline” WARN logs on HTTP/2 parent (TCP) connections in Cosmos’ Reactor Netty transport, while preserving WARN-level signal when exceptions may impact in-flight HTTP/2 streams.
Changes:
- Install an HTTP/2 parent-channel
exceptionCaughthandler fromReactorNettyClientwhen HTTP/2 is enabled. - Add
Http2ParentChannelExceptionHandlerthat consumes parent-channel exceptions and logs at DEBUG vs WARN based on active stream count and channel activity. - Add EmbeddedChannel-based unit tests covering exception consumption behavior, and update changelog entry.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java | Adds logic to install the new handler onto the HTTP/2 parent channel pipeline. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandler.java | New handler that consumes parent-channel exceptions and logs based on connection state. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the fix in the unreleased section. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2ParentChannelExceptionHandlerTest.java | New unit tests validating the handler’s exception consumption behavior. |
| sdk/cosmos/azure-cosmos-tests/pom.xml | Enables surefire tests and includes trailing whitespace changes. |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent channel pipeline has no ChannelOperationsHandler (unlike HTTP/1.1), so TCP-level exceptions like Connection reset by peer (ECONNRESET) propagate to Netty's TailContext, which logs them as WARN. This adds Http2ParentChannelExceptionHandler to the parent channel via doOnConnected (accessing channel.parent()). The handler consumes exceptions at DEBUG level WITHOUT closing the channel or altering connection lifecycle, matching HTTP/1.1 logging behavior. Changes: - Handler logs cause.toString() (not getMessage()) for null-safe diagnostics - Defensive try-catch for duplicate handler name on concurrent stream creation - Before/after verified with EmbeddedChannel unit tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…toString(), update changelog Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d68fa5c to
2a3b5b2
Compare
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
Address Bhaskar's review: add two tests covering the else branch where activeStreams > 0 on an active channel, exercising the WARN log path. - withHandler_activeStreams_consumedAtWarn: creates an active H2 stream via codec.connection().local().createStream(), fires an exception, and verifies it is consumed (does not reach TailContext). - withHandler_activeStreams_channelNotClosed: same setup, verifies the handler does not close the channel even with active streams. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
✅ Review complete (32:05) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
When Http2FrameCodec is absent from the pipeline, getActiveStreamCount() returns -1. Since -1 != 0 and channelActive == true, the handler takes the safe WARN path. This test covers that fallback behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ug log in catch - Change getActiveStreamCount() to return Integer (nullable) instead of int with -1 sentinel. null explicitly means 'could not determine' and takes the safe WARN path. (Addresses Fabian's review) - Add logger.debug in catch block so codec retrieval failures are observable instead of silently swallowed. - Add Error guard in exceptionCaught: Error types (OOM, SOF) propagate to TailContext instead of being consumed. (Addresses Xinlian's review) - Add withHandler_errorNotConsumed_propagatesToTail test. - Update Javadoc to reflect Exception-only consumption and Error passthrough. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
In reactor-netty's H2 path, doOnConnected fires once per TCP connection and connection.channel() IS the parent channel (channel.parent() is null). The previous code assumed doOnConnected fires for child/stream channels where channel.parent() would return the TCP parent. Fix: resolve the H2 parent as channel.parent() ?? channel, handling both the observed case (parent=null, channel IS the parent) and the alternate case (parent!=null, install on parent). Verified with integration test: - Linux/epoll with TCP RST proxy (SO_LINGER=0, 30s idle timeout) - 4.79.1 baseline: TailContext WARN appeared (Connection reset by peer) - Fixed build: WARN suppressed, handler logged at DEBUG (activeStreams=0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
doOnConnected fires for the parent TCP channel in reactor-netty's H2 path, so connection.channel() IS the parent. No need for channel.parent() resolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Add local/remote address to WARN and DEBUG log messages for diagnostic parity with RNTBD connection loggers - Mark handler @ChannelHandler.Sharable with singleton INSTANCE (handler is stateless - no instance fields) - Update ReactorNettyClient to use INSTANCE instead of new - Update tests to use INSTANCE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Matches PartitionProcessor/HealthChecker patterns - avoids SLF4J inline formatting issues. Channel.toString() provides L:/R: addresses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Resolve vmId lazily via ClientTelemetry.getMachineId(null) on first access from non-event-loop thread. Store as immutable field in the @sharable handler singleton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove lazy singleton pattern (getOrCreateInstance) that could call ClientTelemetry.getMachineId() on the Netty event loop (5s blocking). Instead, create handler eagerly in configureChannelPipelineHandlers() which runs on the caller's setup thread. The @sharable handler instance is captured by the doOnConnected lambda. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all blocking calls. Add ClientTelemetry.getCachedMachineId() which reads a volatile field populated by getMachineId() during client init. Handler reads it at log time - pure volatile read, zero blocking. Restores static INSTANCE singleton (handler is stateless again). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
LGTM, Thanks |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Problem
Customers see noisy Netty WARN logs in HTTP/2 scenarios:
Root Cause
In HTTP/2, reactor-netty multiplexes streams on a shared parent TCP connection. The parent and child channels have different pipeline structures:
HTTP/1.1 pipeline (single channel — no leak to TailContext):
HTTP/2 parent channel pipeline (BEFORE fix — leak to TailContext):
HTTP/2 parent channel pipeline (AFTER fix):
HTTP/2 child stream channel pipeline (unchanged):
Design: Connection-State-Based Log Level
The handler consumes
Exceptiontypes on the parent channel (no exception type filtering withinException).Errortypes (e.g.,OutOfMemoryError) are not consumed — they propagate to TailContext for standard Netty handling.The log level for
Exceptiontypes is determined by connection state:activeStreams == 0OR!channelActive.null).Active stream count is retrieved via
Http2FrameCodec.connection().numActiveStreams()on the same parent channel pipeline. Returnsnull(not-1) if the codec is unavailable, making the unknown-state case explicit. Failures to retrieve the stream count are logged at DEBUG.Why no exception type filtering?
By the time any exception reaches our handler, all upstream handlers (
Http2FrameCodec,Http2MultiplexHandler) have already handled the protocol actions (GOAWAY, stream reset, child channel error delivery). The exception reaching TailContext is an echo of already-handled work, regardless of type. Connection state (active streams + channel activity) is the only dimension that determines whether the exception has diagnostic value.Why OR (not AND) for the DEBUG condition?
Either condition alone is sufficient:
activeStreams == 0— no in-flight requests affected, regardless of channel state!channelActive— channel is already dead, any active streams will fail through their Reactor subscribers independentlyWhy no
ctx.close()?The handler does NOT close the channel. Connection lifecycle is owned by:
reactor-netty) — evicts connections based on!channel.isActive(), GOAWAY, maxIdleTime, maxLifeTimeOur handler is the last in the pipeline with the least protocol context. Closing here would race with reactor-netty's pool management and could prematurely kill connections after non-fatal errors.
Why propagate
Errorbut notException?Errortypes (OOM, StackOverflowError) represent JVM-level failures that should not be silently consumed. Re-throwingExceptionto TailContext provides no functional value — TailContext just logs WARN and swallows, which is what our handler already does with better context (connection state in the log message).Testing
9 EmbeddedChannel unit tests with production-matching pipeline (
Http2FrameCodec→Http2MultiplexHandler→ handler):withoutHandler_exceptionReachesTailwithHandler_zeroActiveStreams_consumedAtDebugwithHandler_exceptionDoesNotCloseChannelwithHandler_runtimeException_zeroActiveStreams_consumedwithHandler_npe_zeroActiveStreams_consumedwithHandler_activeStreams_consumedAtWarnwithHandler_activeStreams_channelNotClosedwithHandler_codecAbsent_fallsBackToWarnPathwithHandler_errorNotConsumed_propagatesToTailWARN path log (
activeStreams=1, channelActive=true)Note: The
!channelActivebranch cannot be unit-tested with EmbeddedChannel becausedisconnect()tears down the pipeline beforefireExceptionCaughtcan reach handlers. In production,exceptionCaught()fires while the channel is transitioning to inactive.Impact
exceptionCaught()— Netty@Skipoptimization bypasses it for all hot-path eventsErrortypes are propagated, not consumed